INN Hotels Project - Supervised Learning: Classification

Index

**1. Problem Statement**

Context

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Dictionary

The data contains the different attributes of customers' booking details:

Importing the required libraries

**2. Explore the dataset, Data Preprocessing and extract insights using Exploratory Data Analysis.**

Load the input file and understand the structure of the data

Observations:

The Hotel Booking DataFrame has 19 columns as mentioned in the Data Dictionary. Data in each row corresponds to the details of each booking, both cancelled and not-cancelled.

Observations:

Observations:

This is the 5 point summary of 14 numerical columns.

  1. For no_of_adults, the mean is 1.8 and median is 2. Note that the IQR is zero, meaning that most of the values in this field is equal to 2.
  2. For no_of_children, the mean is 0.1 and median is 0. for more than 80% of the booking there are no children. The maximum value is 10.
  3. For no_of_weekend_nights, the mean is 0.8 and median is 1. The IQR is 2 and the maximum value is 7.
  4. For no_of_week_nights, the mean is 2.2 and median is 2. The IQR is 2 and the maximum value is 17.
  5. For required_car_parking_space, the mean is 0.03 and median is zero, 90% of the customers does not require a car parking space
  6. For lead_time, the mean is 85.2 days with a std deviation of 85.9. The median is 57 days. Even though 68% of the customers have booked 3-4 months in advance, we can see outliers that have booked a hotel years in advance.
  7. For arrival_year, there are only two values 2017 and 2018. Most of the bookings for this data is for calender year 2018 as the IQR is zero and median is 2018.
  8. For arrival_month, the values range from 1 to 12. Since the median (8) is closer to third quantile (10) than first quantile (5), most of the data should be for the months - 8,9 and 10.
  9. For arrival_date, values range from 1 to 31. The mean is 15.6 and median is 16.
  10. For repeated_guest, the mean is 0.02 and median is zero, 90% of the customers are not repeated guests.
  11. For no_of_previous_cancellations, the mean is 0.03 and median is zero, 90% of the customers have zero previous cancellations. The maximum value is 13, so there are outliers who have cancelled significant number of times.
  12. For no_of_previous_bookings_not_canceled, the mean is 0.01 and median is zero, 90% of the customers have previous bookings not canceled by the customer prior to the current booking. The maximum value is 58 which points out the outliers.
  13. For avg_price_per_room, the mean is 103.4 euros with a std deviation of 35. The median is 99.45 days. The IQR is 40 euros, so 68% of the booking are priced between 80 and 120 euros. Some of the prices are zero (that needs to be investigated) and the maximum is 540 euros.
  14. For no_of_special_requests, the mean is 0.6 and median is zero, 50% of the customers have no special requests. The IQR is 1 and the maximum is 5.

Observations:

  1. For the column 'Booking_ID', it is the unique for all rows and is not significant for further analysis.
  2. For the column 'type_of_meal_plan', it has 4 unique values - The most frequent one is 'Meal Plan 1', with a frequency of 27835.
  3. For the column 'room_type_reserved', it has 7 unique values - The most frequent one is 'Room_Type 1', with a frequency of 28130.
  4. For the column 'market_segment_type', it has 5 unique values - The most frequent one is 'Online', with a frequency of 23214.
  5. For the column 'booking_status', it has two unique values - 'Not_Canceled' and 'Canceled'. 'Not_Canceled' appears more than 'Canceled'.

Data Preprocessing

There are 14 columns with a numeric data type, but I think only two of them ('lead_time' and 'avg_price_per_room') can be considered as continuous. Rest of the columns can be considersed as categorical.

Missing Value Treatment

Observations:

  1. There are no missing values in the dataset. But that does not mean there are invalid (or unexpected) values.

Lets look for the hotel bookings with average price for room as Zero.

Observations:

  1. Since most of the bookings with average price for room as Zero is Complementary, that is expected. Also sometimes there is a chance that online bookings can be free with coupons or reward points.

Lets look at the category columns to see if all the values are expected.

Observations:

  1. The value counts for the categorical columns looks good.

Let's add a numeric column with booking status 'Not Canceled' as 0 and 'Cancelled' as 1.

Exploratory Data Analysis (EDA)

Univariate Analysis

Starting with the dependent variable booking status

Observations:

  1. The field booking_status has two unique values - 'Not Canceled' and 'Canceled'.
  2. There are 67% of observations with booking status = 'Not Canceled' and 33% of observations with booking status = 'Canceled'

Let's look at independent variables

Observations:

  1. The field 'no_of_adults' has 5 unique values, 2 adults constitute to 72% of the data.
  2. The next frequent is 1 adult with 21%.

Observations:

  1. The field 'no_of_children' has 6 unique values, Zero children constitute to 92.5% of the data.
  2. The next frequent is 1 child with only 4.5%.

Observations:

  1. The field 'no_of_weekend_nights' has 8 unique values, Zero weekend nights constitute to 46.5% of the data.
  2. The next frequents is 1 and 2 nights with 27.5% and 25% respectiviely.

Observations:

  1. The field 'no_of_week_nights' has 10 unique values, 2 week nights constitute to 31.5% of the data.
  2. The next frequents is 1 and 3 nights with 26.1% and 21.6% respectiviely.

Observations:

  1. The field 'type_of_meal_plan' has 4 unique values, Meal Plan 1 constitute to 76.7% of the data.
  2. 14% of customers did not select a meal plan.

Observations:

  1. The field required_car_parking_space has two unique values - '0' and '1'. (No and Yes)
  2. There are 97% of observations with required_car_parking_space = '0' and 3% of observations with required_car_parking_space = '1'

Observations:

  1. The field 'room_type_reserved' has 7 unique values, Room Type 1 constitute to 78% of the data.
  2. 17% of customers opted for Room_Type 4

Observations:

Observations:

  1. The field arrival_year has two unique values - '2017' and '2018'.
  2. There are 82% of observations with arrival_year = '2018' and 18% of observations with arrival_year = '2017'

Observations:

  1. The field arrival_month signifies the 12 months in a year.
  2. There are 14.7% of observations with arrival_month October, followed by September with 12.7%

Observations:

  1. The field arrival_date signifies the 31 days in a month.
  2. Almost every day contribute 2.5% to 4% of data for this field, an expception being 31 (since there is no 31 in every month).

Observations:

  1. The field 'market_segment_type' has 5 unique values, Online booking constitute to 64% of the data.
  2. 29% of customers opted for Offline booking.

Observations:

  1. The field repeated_guest has two unique values - '0' and '1'. (No and Yes)
  2. There are 97.4% of observations with repeated_guest = '0' and 3% of observations with repeated_guest = '1'

Observations:

  1. The field 'no_of_previous_cancellations' has 9 unique values.
  2. 99% of customers have no previous cancellations.

Observations:

  1. 97.7% of customers have no previous booking that are not cancelled.

Observations:

Observations:

  1. The field 'no_of_special_requests' has 6 unique values, Zero special requests constitute to 54.5% of the data.
  2. Remaining 45.5% of customers had atleast one special request.

Bivariate and Multivariate Analysis

Starting with pairplot and heat map of numerical fields

Observations:

  1. It is hard to figure anything out of this pairplot. Let's look into the correlation matrix and heatmap.

Observations (as per the heatmap):

  1. There is a positive correlation (0.44) between lead_time and booking_status. In other words, the bookings with greater lead time has more chances of getting cancelled.
  2. There is a negative correlation (-0.25) between no_of_special_requests and booking_status. Hence the booking with fewer number of special requests are likely to get cancelled.

Bivariate analysis

Observations:

  1. As observed in heatmap, booking with higher lead_time has a higher chance of getting cancelled.
  2. The median lead_time for a not-cancelled booking is less than 50 days where as the median lead_time for a cancelled booking is more than 100 days.

Observations:

  1. As observed in heatmap, the average price of the room does not play a major factor in deciding the booking status.
  2. Being said that, it can be observed that the median avg room price for a not-cancelled booking is less than 100 euros where as the median avg room price for a cancelled booking is more than 100 euros.

Observations:

  1. As observed in heatmap, the no_of_adults does not play a major factor in deciding the booking status.
  2. The cancellations are less for no of adults = 4, but the amount of data for no of adults = 4 is very less to make that conclusion.

Observations:

  1. As observed in heatmap, the no_of_children does not play a major factor in deciding the booking status.
  2. The cancellations are zero for no_of_children = 10, but the there is only one observation for that condition.

Observations:

  1. As observed in heatmap, the no_of_weekend_nights does not play a major factor in deciding the booking status.
  2. All the booking are cancelled for no_of_weekend_nights = 7, but the there is only one observation for that condition.

Observations:

  1. As observed in heatmap, the no_of_week_nights does not play a major factor in deciding the booking status.
  2. All the booking are cancelled for no_of_week_nights equal to 16 and 13, but the there are not many observations for that condition.

Observations:

  1. As observed in heatmap, the type_of_meal_plan does not play a major factor in deciding the booking status.
  2. Bookings have less cancellation for type_of_meal_plan equal to 'Meal Plan 3', but the there are only 5 observations for that condition.

Observations:

  1. As observed in heatmap, the required_car_parking_space does not play a major factor in deciding the booking status.
  2. Bookings have less cancellation for required_car_parking_space equal to 'Yes', but the there are only 1124 observations for that condition.

Observations:

  1. As observed in heatmap, the room_type_reserved does not play a major factor in deciding the booking status.
  2. Bookings have less cancellation for Room types 3, 5 and 7, but the there are only a few hunderds of observations for that condition.

Observations:

  1. As observed in heatmap, the arrival_year does not play a major factor in deciding the booking status.
  2. Bookings are less cancelled for calender year 2017, but the number of observations are also less for calender year 2017 compared to 2018.

Observations:

  1. As observed in heatmap, the arrival_month does not play a major factor in deciding the booking status.
  2. The cancellation of booking for the month is in proportion with the number of bookings for the month.

Observations:

  1. As observed in heatmap, the arrival_date does not play a major factor in deciding the booking status.
  2. The cancellation of booking for the day is in proportion with the number of bookings for the day.

Observations:

  1. As observed in heatmap, the market_segment_type has a minor influence in deciding the booking status.
  2. None of the complementary bookings are cancelled, which is significant.

Observations:

  1. As observed in heatmap, the market_segment_type does not play a major factor in deciding the booking status.
  2. None of the complementary bookings are cancelled, which is significant.

Observations:

  1. As observed in heatmap, the no_of_previous_cancellations does not play a major factor in deciding the booking status.
  2. Most of the cancellations are concentrated for no_of_previous_cancellations = 0.

Observations:

  1. As observed in heatmap, the no_of_previous_bookings_not_canceled does not play a major factor in deciding the booking status.
  2. Most of the cancellations are concentrated for no_of_previous_bookings_not_canceled = 0.

Observations:

  1. As observed in heatmap, the no_of_special_requests does play a major factor in deciding the booking status.
  2. Bookings have less cancellation for no_of_special_requests 3, 5 and 5, but the there are only a few hunderds of observations for that condition.

Multivariate analysis

Observations:

  1. As we can see, the higher the lead time greater has the chance of getting cancelled (1 stands for cancellation). In other words, the bookings with least lead_time has less chance to getting cancelled.
  2. There is no clear relationship between average price per room and booking status, but we can see that the lower the average price, there is less chance of getting cancelled.
  3. Also notice that the average price is low when lead_time is high.

Observations:

  1. As we can see, there are no cancellations for complementary bookings, but since that does not impact the revenue, it can be ignored.
  2. The difference in average lead time between cancelled and not-cancelled bookings is high for offline market segment, meaning the offline bookings are done well in advance and is getting cancelled.
  3. The difference in average lead time between cancelled and not-cancelled bookings is lower for online and corporate market segments, and is almost zero for aviation segment.

Observations:

  1. As we have see earlier, Online market segment contributes more to the price than any segment. Also this segment has the highest number of special requests, ranging from 1 to 5.
  2. Also we can notice that customers with more special requests have less cancellations.
  3. There are no special requests from Aviation market segment and the cancellations are low.
  4. There are fewer special requests for Corporate market segment and the cancellations are low.

**3. Questions on Exploratory Data Analysis(EDA)**

Questions

1. What are the busiest months in the hotel?

Answer: The busiest months are October (14.6%) and September (12.7%).

2. Which market segment do most of the guests come from?

Observations: Most of the guests came from market segment 'Online' (63.9%)

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observations: The average price per room is higer for Online market segment (112 euros) followed by Aviation, Offline and Corporate. The Complementary market segment bookings have a very low price (3 euros), as most of them are free.

4. What percentage of bookings are canceled?

Observations: 32.76 % of the all bookings are cancelled as per the data provided.

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Observations: Only 1.72 % of the bookings are cancelled by repeated customers.

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observations: 43.21 % of the bookings are cancelled by customers with special requirements.

Data Pre-processing

Let's encode the categorical columns - type_of_meal_plan, room_type_reserved and market_segment_type.

Note that all the columns have now became numeric for Model prediction.

Outlier Treatment

For now, I have decided not to do outlier treatment as the models are capable of handling outliers.

Splitting up Data for Model Prediction

There are 32.76% of cancelled bookings in original dataset. This ratio is consistent with Training and testing dataset.

Functions for calculating Model Performance

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

**4. Building Logistic Regression models**

Logistic Regression model (with sklearn)

Build a Logistic Regression model (with sklearn) with the training data and then check it's performance.

Checking model performance on the training set

Checking model performance on the testing set

Observations:

  1. The training and testing accuracy is about 80%, but the recall are 62.4%. 2. Recall on the train and test sets are comparable.
  2. This shows that the model is giving a generalised result. Let's keep trying to get a higher recall.

Logistic Regression model (with statsmodels)

Build a Logistic Regression model (with statsmodels) with the training data and then check it's performance.

Observations:

Multicollinearity

But these variables might contain multicollinearity, which will affect the p-values. We will have to remove multicollinearity from the data to get reliable coefficients and p-values.

There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.

Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearity that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is "inflated" by the existence of correlation among the predictor variables in the model.

General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.

The variable market_segment_type_Online exhibit high multicollinearity, Lets drop it from the dataset.

Let's remove the insignificant features (p-value>0.05).

We will do the following repeatedly using a loop:

The above columns are the significant columns. Lets create X_train2 and X_test2 with the significant columns.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train2 as the significant features and hotel_booking_statsmodels_log_reg_model2 as the final model.

Coefficient interpretations:

Converting coefficients to odds:

Coefficient interpretations:

Interpretation for other attributes can be done similarly.

Checking model performance on the training set

Optimal threshold using AUC-ROC curve

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on the testing set

Using model with threshold = 0.317

Using model with threshold = 0.416

**5. Logistic Regression - Model Performance Summary - Train and Test**

Model Performance Summary - Training set

Model Performance Summary - Testing set

Conclusion

  1. All the models are giving a generalized performance on training and test set.
  2. The highest recall is 78.5% on the testing set with 0.317 threshold.
  3. Using the model with sklearn and statsmodels (default threshold) will give a low recall but good precision scores - This model will help the hotel save resources but will incorrectly predict potential non-cancelling customers.
  4. Using the model with 0.317 threshold the model will give a high recall but low precision scores - This model will help the hotel identify non-cancelling customers effectively but the cost of resources will be high.
  5. Using the model with 0.416 threshold the model will give a balance recall and precision score - This model will help the hotel to maintain a balance in identifying potential non-cancelling customers and the cost of resources.

**6. Building Decision Tree models**

Decision Tree model (with sklearn)

Build a Decision Tree model (with sklearn) with the training data and then check it's performance.

Checking model performance on training set

Checking model performance on test set

Model is giving good and generalized results on training and test set.

Visualizing the Decision Tree (sklearn decision tree)

Observations:

Decision Tree model (using GridSearch for Hyperparameter tuning)

Let's see if we can improve our model performance even more.

Checking model performance on training set

Checking model performance on test set

Observations:

Visualizing the Decision Tree (GridSearch for Hyperparameter tuning)

Observations:

Decision Tree model (Cost Complexity Pruning)

Observations:

Plotting total impurity vs effective alpha for training set

Observarions:

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

Plotting Number of nodes vs alpha and Depth vs alpha

Accuracy vs alpha for training and testing sets

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

Since accuracy isn't the metric we are looking for, we would try to find a model with high recall

Recall vs alpha for training and testing sets

Observations:

This dataset is giving highest recall for low alpha values, which is unusual. Maximum value of Recall is at 0.000024 alpha, but if we choose decision tree will have a thousand nodes. We would lose the buisness rules and the tree will be hard to understand, instead we can choose alpha 0.00666 and get a good recall.**

Checking model performance on training set

Checking model performance on test set

Observations:

Visualizing the Decision Tree (Cost Complexity Pruned decision tree)

Observations:

**7. Decision tree models - Performance Summary - Train and Test**

Model Performance Summary - Training set

Model Performance Summary - Testing set

Conclusion

  1. The sklearn model is an overfit model, but the other 2 pruned models are giving a generalized performance on training and test set.
  2. The highest recall is 76.7% on the testing set using GridSearch for Hyperparameter tuning.
  3. Using the model with sklearn is overfit and is hard to interpret business rules from it.
  4. Using the model with using GridSearch for Hyperparameter tuning will give a high recall but low precision scores - This model will help the hotel identify non-cancelling customers effectively but the cost of resources will be high.
  5. Using the model with Cost Complexity Pruning will give a balance recall and precision score - This model will help the hotel to maintain a balance in identifying potential non-cancelling customers and the cost of resources.

**8. Actionable Insights and Recommendations**

Insights from the Decision Tree model

  1. Lead time is a major factor that influences the cancellation of a hotel booking. Higher the lead time of booking, higher the chance of cancellation.
  2. Online market segment also influences the cancellation of a hotel booking. If the market segment is online, higher the chance of cancellation than any other segments.
  3. The more the number of special requests are accommodated by hotel, the lesser the chance of cancellation.
  4. Bookings with lower average price per room have lesser chance of getting cancelled. Higher the price, higher the chance of getting cancelled.

Insights from the Logistic Regression model

  1. Repeated Guest has a negative correlation to cancellation. So if the guest is a regular customer, there is a less chance of cancellation.
  2. Market segments Offline and Corporate has negative correlation to cancellation. So if the guest is from Offline or Corporate market segment, there is a less chance of cancellation.
  3. Required_car_parking_space has a negative correlation to cancellation. So if the guest has requested for a parking space, there is a less chance of cancellation.
  4. Just like decision tree model, there is lesser chance of cancellation if the hotel can accommodate special requests from customers.
  5. No_of_previous_cancellations has a positive correlation to cancellation. So if the guest has cancelled before, there are chances that he/she will cancel again.

Business Recommendations:

  1. Even though customers with higher lead time has a good chance of cancellation, INN Hotels should allow customers to book early as filling hotel rooms is also important for revenue. Now the catch is to find the customers who will cancel and have them cancel as early as possible so that the hotel can market that room again at a higher price. INN hotels should send weekly reminders (email and text) to customers about their advance booking and remind them about the consequences of not cancelling the room in advance.
  2. INN hotels should try to accommodate as many special requests from customers. This will drive customer satisfaction and will reduce last minute cancellations.
  3. INN hotels should try to give coupons/discounts for bookings with less lead time as these bookings with lower average prices have less chances of getting cancelled.
  4. INN Hotels should try to negotiate deals with Corporates to have them book the hotel for their staff, as Corporate market segment have less chances of getting cancelled.
  5. INN Hotels should initiate a reward point system to reward regular customers and provide them with room upgrades. Repeated Guests have less chances of cancelling their booking.